 |
 |
XML for the absolute beginner
A guided tour from HTML to processing XML with Java

Printer-friendly
version | Mail this to a friend
Page 2 of 10
HTML: All form and no substance
HTML is a language designed to "talk about" documents:
headings, titles, captions, fonts, and so on. It's heavily document
structure- and presentation-oriented.
Admittedly, artists and hackers have been able to work miracles with
the relatively dull tool called HTML. But HTML has serious drawbacks that
make it a poor fit for designing flexible, powerful, evolutionary
information systems. Here a few of the major complaints:
- HTML isn't extensible
An extensible markup
language would allow application developers to define custom tags for
application-specific situations. Unless you're a 600-pound gorilla (and
maybe not even then) you can't require all browser manufacturers to
implement all the markup tags necessary for your application. So, you're
stuck with what the big browser makers, or the W3C (World Wide Web
Consortium) will let you have. What we need is a language that allows us
to make up our own markup tags without having to call the browser
manufacturer.
- HTML is very display-centric
HTML is a fine
language for display purposes, unless you require a lot of precise
formatting or transformation control (in which case it stinks). HTML
represents a mixture of document logical structure (titles, paragraphs,
and such) with presentation tags (bold, image alignment, and so on).
Since almost all of the HTML tags have to do with how to display
information in a browser, HTML is useless for other common network
applications -- like data replication or application services. We need a
way to unify these common functions with display, so the same server
used to browse data can also, for example, perform enterprise business
functions and interoperate with legacy systems.
- HTML isn't usually directly reusable
Creating
documents in word-processors and then exporting them as HTML is somewhat
automated but still requires, at the very least, some tweaking of the
output in order to achieve acceptable results. If the data from which
the document was produced change, the entire HTML translation needs to
be redone. Web sites that show the current weather around the globe,
around the clock, usually handle this automatic reformatting very well.
The content and the presentation style of the document are separated,
because the system designers understand that their content (the
temperatures, forecasts, and so on) changes constantly. What we
need is a way to specify data presentation in terms of structure, so
that when data are updated, the formatting can be "reapplied"
consistently and easily.
- HTML only provides one 'view' of data
It's
difficult to write HTML that displays the same data in different ways
based on user requests. Dynamic HTML is a start, but it requires an
enormous amount of scripting and isn't a general solution to this
problem. (Dynamic HTML is discussed in more detail below.) What we need
is a way to get all the information we may want to browse at once, and
look at it in various ways on the client.
- HTML has little or no semantic structure
Most
Web applications would benefit from an ability to represent data by
meaning rather than by layout. For example, it can be very difficult to
find what you're looking for on the Internet, because there's no
indication of the meaning of the data in HTML files (aside from META
tags, which are usually misleading). Type red into a search
engine, and you'll get links to Red Skelton, red herring, red snapper,
the red scare, Red Letter Day, and probably a page or two of "Books I've
Red." HTML has no way to specify what a particular page item means. A
more useful markup language would represent information in terms of its
meaning. What we need is a language that tells us not how to
display information, but rather, what a given block of
information is so we know what to do with it.
SGML has none of these weaknesses, but in order to be general, it's
hair-tearingly complex (at least in its complete form). The language used
to format SGML (its "style language"), called DSSSL (Document Style
Semantics and Specification Language), is extremely powerful but difficult
to use. How do we get a language that's roughly as easy to use as HTML but
has most of the power of SGML?
Origins of XML As the Web
exploded in popularity and people all over the world began learning about
HTML, they fairly quickly started running into the limitations outlined
above. Heavy-metal SGML wonks, who had been working with SGML for years in
relative obscurity, suddenly found that everyday people had some
understanding of the concept of markup (that is, HTML). SGML experts began
to consider the possibility of using SGML on the Web directly, instead of
using just one application of it (again, HTML). At the same time, they
knew that SGML, while powerful, was simply too complex for most people to
use.
In the summer of 1996, Jon Bosak (currently online information
technology architect at Sun Microsystems) convinced the W3C to let him
form a committee on using SGML on the Web. He created a high-powered team
of muckety-mucks from the SGML world. By November of that year, these
folks had created the beginnings of a simplified form of SGML that
incorporated tried-and-true features of SGML but with reduced complexity.
This was, and is, XML.
In March 1997, Bosak released his landmark paper, "XML, Java and the
Future of the Web" (see Resources).
Now, two years later (a very long time in the life of the Web), Bosak's
short paper is still a good, if dated, introduction to why using XML is
such an excellent idea.
SGML was created for general document structuring, and HTML was created
as an application of SGML for Web documents. XML is a simplification of
SGML for general Web use.
Next
page > Page 1 XML
for the absolute beginner Page 2 HTML: All form and no substance
Page 3 An
XML conceptual example Page 4 Make
up a markup Page 5 So,
what good is made-up markup? Page 6 Cascading
Style Sheets: not just for HTML anymore Page 7 XSL:
I like your style Page 8 Modeling
information structure in XML Page 9 XML
and Java Page 10 Become
a tree surgeon!
Printer-friendly
version | Mail this to a friend
Resources There are so
many XML resources on the Web, I've had to categorize. The first section
here is the most useful, since the documents are either high-level
summaries or excellent link sites. Apologies to anyone who was omitted.
XML and Java: General XML resources
- "XML, Java and the Future of the Web," Jon Bosak. The paper that
started it all, at least from a Java programmer's point of view.
Definitely worth a read, even if it's a bit dated. Jon is commonly
considered to be the father of XML. Funny how all of these technologies
seem to have paternity:
http://metalab.unc.edu/pub/sun-info/standards/xml/why/xmlapps.html
- "Media-Independent Publishing: Four Myths about XML" Jon Bosak:
http://metalab.unc.edu/pub/sun-info/standards/xml/why/4myths.htm
- Robin Cover's XML-SGML site is, according to my SGML buddies, the
bible of XML resources:
http://www.oasis-open.org/cover/
- The W3C's XML resource page lets you cheer from the sidelines as XML
technology proposals develop into recommendations, or join in the fray
on their active mailing lists:
http://www.w3.org/XML/
- OASIS, the Web site of the Organization for the Advancement of
Structured Information Standards, offers general news and information
about XML:
http://www.oasis-open.org/
- The Graphics Communications Association, host of the XTech '99
conference (March 11 to 13, 1999, San Jose, CA) and the upcoming XML
Europe '99 conference in Granada, Spain, (April 26 to 30, 1999) has a
Web site packed with XML information:
http://www.gca.org/
- XML.com is great for watching trends and digging up XML news:
http://www.xml.com/
- Textuality hosts Tim Bray's site. Check it out for a look at the
"big picture" of how XML fits into the structured document universe --
and for a look at Lark, Tim's nonvalidating XML processor:
http://www.textuality.com/
- The XML FAQ:
http://www.ucc.ie/xml/
- IBM's XML Website is an outstanding supplement to alphaWorks:
http://www.software.ibm.com/xml/index.html
XML and Java
- "XML and Java: The Perfect Pair" by Ken Sall (Internet.com, November
1998) provides information about XML, Java, and why these two are a
match made in heaven:
http://wdvl.com/Authoring/Languages/XML/Java/index.html
Tutorials and training
- Generally Markup, Richard Lander's Web site may be of interest to
you if you haven't yet read enough about markup languages:
http://pdbeam.uwaterloo.ca/~rlander/
- The Mulberry Technologies Web site is a good resource for commercial
training in XML, as well as general XML and SGML consulting by seasoned
SGML experts:
http://www.mulberrytech.com/
- The Web Developer's Virtual Library Series on XML offers good
summaries of various XML technologies, as well as annotated indices of
XML software:
http://wdvl.com/Software/XML
- Microsoft's Site Builder Network provides a series of articles
called "Extreme XML," one of which appears in the following link. While
some of it focuses on Microsoft-only, Windows-only technology, there's
still some great stuff here:
http://www.microsoft.com/sitebuilder/magazine/xml.asp
- Webmonkey has a good series of articles introducing readers to XML.
The index is at:
http://www.hotwired.com/webmonkey/xml/?tw=xml
- "What the ?xml!" by L.C. Rees offers an interesting take on XML and
why it's necessary -- nicely written and entertaining to boot:
http://www.geocities.com/SiliconValley/Peaks/5957/wxml.html
- "The XML Revolution" by Dan Connolly is a quick backgrounder on XML
(Nature):
http://helix.nature.com/webmatters/xml.html
Cascading Style Sheets
- W3C's CSS page will get your started learning about CSS:
http://www.w3.org/Style/CSS/
- "Cascading Style Sheets Designing for the Web" by Hakom Wium Lie and
Bert Bos (Addison-Wesley, 1997) Sample chapters from the book appear at:
http://www.awl.com/cseng/titles/0-201-41998-X/liebos/
Extensible Style Language (XSL)
- The W3C's XSL page:
http://www.w3.org/Style/XSL/
- Read (and comment on) the W3C's XSL Working Draft (currently dated
December 16, 1998):
http://www.w3.org/TR/WD-xsl
- "The Extensible Style Language: Styling XML Documents"
(WebTechniques Magazine) XSL tutorial information and examples:
http://www.webtechniques.com/features/1999/01/walsh/walsh.shtml
- Microsoft's XML and XSL tutorial site is especially interesting
because of the recent release of client-side XSL in Internet Explorer
5.0. Extensive and excellent:
http://www.microsoft.com/xml
- If you're still using IE 4.0, you can still experiment with XML,
using Microsoft's internal DOM:
http://www.microsoft.com/xml/articles/xmlmodel.asp
- If you want to experiment with XSL, try downloading IBM's LotusXSL.
It's all Java, and for the time being, it's free:
http://www.alphaworks.ibm.com/tech/LotusXSL
- Or, you can try James Clark's XT XSL engine, downloadable from:
http://www.jclark.com/xml/xt.html
Upcoming XSL contest
Though the details aren't yet worked out, Sun Microsystems will soon
announce a call for proposals for a $30,000 grant to develop a
client-side processor for full XSL implementation in Mozilla.
It will also announce, in conjunction with Adobe, a contest (first prize
$40,000, second prize $20,000) to develop a pure-Java, server-side
processor of the entire XSL language, to format XML to PDF (Adobe's
document format). Keep watching the Java Developer Connection (requires
free registration), and Mozilla sites for the eventual announcements.
- "XTech '99: Java and the XML wave" by Mark Johnson
(JavaWorld, April 1999) offers the most current information on
the contest:
http://www.javaworld.com/javaworld/jw-04-1999/jw-04-xtech.html
Simple API for XML (SAX)
- The definitive description of SAX is available online. You can also
download free SAX software here:
http://www.megginson.com/SAX/index.html
Document Object Model (DOM)
- The W3C information page for the Document Object Model appears on
the W3C site:
http://www.w3c.org/DOM/
- Among other things, you'll find the W3C Recommendation for DOM Level
1:
http://www.w3.org/TR/REC-DOM-Level-1/
- The Java bindings for DOM, for both XML and HTML, are in this
Recommendation appendix:
http://www.w3.org/TR/REC-DOM-Level-1/java-language-binding.html
- A great DOM tutorial by William Robert Stanek appears on PC
Magazine Online in "Object-Based Web Design." This tutorial
includes a discussion of using DOM with IDL, CORBA's Interface
Definition Language:
http://www8.zdnet.com/pcmag/pctech/content/17/13/tf1713.001.html
Dynamic HTML
- The Dynamic HTML Resource page contains several links to DHTML
articles:
http://www.hotwired.com/webmonkey/dynamic_html/?tw=dynamic_html
Software
- Epicentric, Inc.:
http://www.epicentric.com/
- More XML (and other Java) technology than you can shake a stick at
is available at IBM's alphaWorks:
http://alphaworks.ibm.com/
- Version 2 of IBM's excellent XML parser package, xml4j, is available
for download. This package includes several parsers, both validating and
nonvalidating:
http://www.alphaworks.ibm.com/tech/xml4j
- See also IBM's exciting Bean Markup Language project, which uses XML
to represent and manipulate JavaBeans:
http://www.alphaworks.ibm.com/tech/bml
- Another free Java XML parser was written by the indefatiguable James
Clark, download at:
http://www.jclark.com/xml/xp/index.html
- XEENA is IBM alphaWorks's DTD-guided XML editor. You want it, you
need it, you gotta have it:
http://www.alphaworks.ibm.com/tech/xeena
- Mozilla.org is the open source community's effort to extend the
Netscape source code. Find out about it at:
http://www.mozilla.org/
- Information about XML and CSS in Mozilla appears at:
http://www.mozilla.org/rdf/doc/xml.html
- You can read about Sun's XML and Java initiatives at:
http://www.sun.com/990310/java_xml.jhtml
- In addition, Java Project X includes source code downloadable from:
http://developer.java.sun.com/developer/earlyAccess/xml/index.html
- ArborText has a suite of sophisticated tools for editing SGML, XML,
and XSL:
http://www.arbortext.com/Products/products.html
- Oracle8i from Oracle corporation uses XML inside the Oracle core:
http://www.oracle.com/xml/
- Download Oracle's free XML for Java parser:
http://technet.oracle.com/direct/3xml.htm
- Microsoft's Internet Explorer 5.0, released this month, implements
part of the XSL spec. You can find it on Microsoft's Web site -- and
also just about anywhere else:
http://www.microsoft.com/windows/ie/default.htm
- You can also download a beta release of Microsoft's XML Notepad
editor (limited to running only on Microsoft Windows):
http://www.microsoft.com/xml/notepad/download.asp
- Vervet Logic of Bloomington, IN, has announced XML <PRO>, a
commercial XML editor:
http://www.vervet.com/
- Majix, to transform XML to HTML via XSL, is available at:
http://www.tetrasix.com/
- If your French is rusty, you might want to try the English-language
site at:
http://www.tetrasix.com/english/default.htm
History
- Read about the history of HTML here. It's part of an online book, so
there's no telling for how long it will be available:
http://ei.cs.vt.edu/~wwwbtb/hardcopy/book/chap4/origins.html The
two chapters listed below (of the book "HTML Unleashed" by Rick Darnell,
et al., also cover some of the technical background of these languages.
- SGML history
http://www.webreference.com/dlab/books/html/3-2.html
- XML history (such as it is):
http://www.webreference.com/dlab/books/html/38-0.html
- Nothing to do on Friday night? Why not read up on the history of
SGML? Charles Goldfarb, considered by many to be the "father of SGML,"
reminisces publicly at:
http://www.sgmlsource.com/Goldfarb/history/index.htm
- Useful XML and SGML information appears at Goldfarb's Web site,
including a comprehensive XML book list:
http://www.sgmlsource.com/
Miscellaneous links
- Uche Ogbuji has written an interesting article in
LinuxWorld about using XML on Linux in the Enterprise. It's at:
http://www.linuxworld.com/linuxworld/lw-1999-03/lw-03-xml.html
- Bluestone Software has recently made a splash with pure-Java XML
application servers, and a freely downloadable Swing package called
XwingML:
http://www.bluestone.com/
- Everyone (except Microsoft) is pretty freaked out about the US
Patent Office awarding Microsoft a patent for certain kinds of
functionality in style sheets. What happens with this patent, and its
impact on developing technology, remains to be seen. Judge for yourself
by reading the patent at:
http://www.patents.ibm.com/patlist?icnt=US&patent_number=5860073
- The title of the sample recipe is actually the title of a very funny
song by William Bolcom. Similar recipes may be found at:
http://www.b4uby.com/granny/gsoup.htm
- The song appears on a compact disc (with other odd songs) available
from the Public Radio Music Source at:
http://75music.org/best/docs/keepers.htm
|
 |